This project contains of two files:
So to use this project you should install Jupyter Notebook with Python environment contaning appropriate libraries.
While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook. In my project I used Python 3.6. I recommend using the Anaconda distribution to install Python and Jupyter. We’ll go through its installation in the next section.
For new users, it is highly recommended installing Anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
You should install following used in project libraries using Anaconda - Environments
Here you can see an example for scikit library.

After you have installed the Jupyter Notebook on your computer, you are ready to run the notebook server. You can start the notebook server from the command line (using Terminal on Mac/Linux, Command Prompt on Windows) by running:
This will print some information about the notebook server in your terminal, including the URL of the web application (by default, http://localhost:8888):
jupyter notebook
[I 08:58:24.417 NotebookApp] Serving notebooks from local directory: /Users/alexander.leontev
[I 08:58:24.417 NotebookApp] 0 active kernels
[I 08:58:24.417 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/
[I 08:58:24.417 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

It will then open your default web browser to this URL. When the notebook opens in your browser, you will see the Notebook Dashboard, which will show a list of the notebooks, files, and subdirectories in the directory where the notebook server was started. Most of the time, you will wish to start a notebook server in the highest level directory containing notebooks. Often this will be your home directory.

So now use can use this project opening file project_heart_disease.ipynb and run codes exploring something interesting.
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
Firstly, we should import all the libraries that we will use in our application. All necessary Python modules imports are shown below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.model_selection import GridSearchCV,train_test_split,cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc
import os
import warnings
warnings.filterwarnings('ignore')
Let's uplaod our data set to the data variable using the read_csv function in the pandas library.
print(os.listdir("./input"))
data = pd.read_csv('./input/heart.csv')
data = data.sample(frac=1)
# Now, our data is loaded. We're writing the following snippet to see the loaded data. The purpose here is to see the top five of the loaded data.
print('Data First 5 Rows Show\n')
data.head()
print('Data Last 5 Rows Show\n')
data.tail()
Both the head() and tail() functions have a value of 5 by default. Different values should be given as parameters to change these values.
# How many rows and columns are there for all data?
print('Data Shape Show\n')
data.shape #first one is rows, other is columns
print('Data Show Info\n')
data.info()
# Now, we will check null on all data and If data has null, I will sum of null data's. In this way, how many missing data is in the data.
print('Data Sum of Null Values \n')
data.isnull().sum()
# All rows control for null values
data.isnull().values.any()
So, there is no missing data in the database
print('Data Show Describe\n')
data.describe()
The features described in the above data set are:
Count tells us the number of NoN-empty rows in a feature.
Mean tells us the mean value of that feature.
Std tells us the Standard Deviation Value of that feature.
Min tells us the minimum value of that feature.
25%, 50%, and 75% are the percentile/quartile of each features.
Max tells us the maximum value of that feature.
Observing the target as it is the output column which shows us if the patients had the heart dieases or not. Here we are observing target values and we can observe that we have more people with cases having heart dieases then people not having heart dieases in our data.
fig, ax = plt.subplots(figsize=(5, 8))
sns.countplot(data['target'])
plt.title('Target values')
Now lets see how many male and female in our data
male =len(data[data['sex'] == 1])
female = len(data[data['sex']== 0])
plt.figure(figsize=(8,6))
# Data to plot
labels = 'Male', 'Female'
sizes = [male, female]
colors = ['skyblue', 'yellowgreen']
explode = (0, 0) # explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.show()
Fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
plt.figure(figsize=(8,6))
# Data to plot
labels = 'fasting blood sugar < 120 mg/dl','fasting blood sugar > 120 mg/dl'
sizes = [len(data[data['fbs'] == 0]),len(data[data['cp'] == 1])]
colors = ['skyblue', 'yellowgreen','orange','gold']
explode = (0.1, 0) # explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.show()
Now below I have shown distribution of age over database.
sns.distplot(data['age'])
Here we are observing the bar plot with the belive that different type of chest pain causes different type of targets in male and female. We can see that cp type 4 is more responsible in female while chest pain type 2 is more responsible in male.
sns.set(font_scale=2)
fig, ax = plt.subplots(figsize=(20, 10))
plt.title('Target values with different cp in male and female')
sns.barplot(x='sex', y='target', hue='cp', data=data)
Below i have shown a plot by samoling the age feature in some parts to observe the effect of excersize on the maximum heart rate achieved in different agegroup. We can see that here the maximum heart beat decreases as the age increases but it further decreases as the patient has excersize induced angina. by this we can say that by achieving lower maximum heart beat we can predict and save ourselves from excersize induced angina.
ages = ['age']
bins = [29, 35, 45, 55, 65, 77]
labels = [ '25-40','41-50', '51-60', '61-70', '70+']
data['agerange'] = pd.cut(data.age, bins, labels = labels, include_lowest = True)
fig, ax = plt.subplots(figsize=(20, 10))
plt.title('Maximum heart rate as the agerange with and without excersize induced angina ')
sns.barplot(x = 'agerange', y = 'thalach', hue = 'exang', data = data)
data = data.drop(["agerange"] , axis=1)
Thalach: maximum heart rate achieved
sns.distplot(data['thalach'], kde = False, bins = 30, color = 'violet')
Chol: serum cholestoral in mg/dl
sns.distplot(data['chol'], kde = False, bins = 30, color = 'red')
plt.show()
Trestbps: resting blood pressure (in mm Hg on admission to the hospital)
sns.distplot(data['trestbps'], kde = False, bins = 30, color = 'blue')
plt.show()
Number of people who have heart disease according to age
plt.figure(figsize = (30, 12))
sns.countplot(x = 'age', data = data, hue = 'target', palette = 'GnBu')
plt.show()
Below I have shown a correlation matrix to show if there is any corelelation between different features. And we can say that there is not any serious corelation between features.
Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient: Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression.
Pearson's correlation
The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit).
Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if we want to measure how two stocks are related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The point-biserial correlation is conducted with the Pearson correlation formula except that one of the variables is dichotomous. The following formula is used to calculate the Pearson r correlation:

The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as the value of one variable increases, the value of the other variable decreases. This is shown in the diagram below:

The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or negative, respectively. Achieving a value of +1 or -1 means that all your data points are included on the line of best fit – there are no data points that show any variation away from this line. Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the value of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation coefficients are shown in the diagram below:

plt.figure(figsize=(40, 20))
sns.heatmap(data.corr(method="pearson"), annot = True, cmap='coolwarm', linewidths=.1)
plt.show()
Spearman's correlation
Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation:

plt.figure(figsize=(40, 20))
sns.heatmap(data.corr(method="spearman"), annot = True, cmap='coolwarm', linewidths=.1)
plt.show()
Kendall's correlation
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.
It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.
Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.
Both Kendall's and Spearman's can be formulated as special cases of a more general correlation coefficient.
Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables. If we consider two samples, a and b, where each sample size is n, we know that the total number of pairings with a b is n(n-1)/2. The following formula is used to calculate the value of Kendall rank correlation:

plt.figure(figsize=(40, 20))
sns.heatmap(data.corr(method="kendall"), annot = True, cmap='coolwarm', linewidths=.1)
plt.show()
As result we can see that there is not any serious corelation between features.
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance. Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Other classifiers work by comparing observations to previous observations by means of a similarity or distance function. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category. Terminology across fields is quite varied. In statistics, where classification is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or independent variables, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the dependent variable. In machine learning, the observations are often known as instances, the explanatory variables are termed features (grouped into a feature vector), and the possible categories to be predicted are classes. Other fields may use different terminology: e.g. in community ecology, the term "classification" normally refers to cluster analysis, i.e., a type of unsupervised learning, rather than the supervised learning described in this article.
We will implement, research and compare 5 algorithms of classification:
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. Logistic regression is implemented in LogisticRegression. This implementation can fit binary, One-vs-Rest, or multinomial logistic regression with optional , or Elastic-Net regularization.

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.

Some advantages of decision trees are:
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features. (See the parameter tuning guidelines for more details). The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model. In contrast to the original publication [B2001], the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the k-nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. The k-neighbors classification in KNeighborsClassifier is the most commonly used technique. The optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of noise, but makes the classification boundaries less distinct. In cases where the data is not uniformly sampled, radius-based neighbors classification in RadiusNeighborsClassifier can be a better choice. The user specifies a fixed radiusr , such that points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-dimensional parameter spaces, this method becomes less effective due to the so-called “curse of dimensionality”. The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit. This can be accomplished through the weights keyword. The default value, weights = 'uniform', assigns uniform weights to each neighbor. weights = 'distance' assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-defined function of the distance can be supplied to compute the weights.
GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:

The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using sklearn.linear_model.LinearSVC or sklearn.linear_model.SGDClassifier instead, possibly after a sklearn.kernel_approximation.Nystroem transformer. The multiclass support is handled according to a one-vs-one scheme. For details on the precise mathematical formulation of the provided kernel functions and how gamma, coef0 and degree affect each other, see the corresponding section in the narrative documentation: Kernel functions.
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
The advantages of support vector machines are:
Firstly, we should split the dataset into training and test set
y = data["target"].values
x = data.drop(["target"] , axis = 1)
Preprocessing - Scaling the features
# Normalization
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(x)
# Train test split
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.2 , random_state = 0)
Trying different algorithms using Scikit-learn (formerly scikits.learn) librart. It is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)
We can see the resulted linear equation:
print("Logistic Regression linear equation:")
y_string = "y = "0
for i in range(13):
y_string += str(round(lr.coef_[0][i], 4))+"*x"+str(i)+" + "
y_string += str(round(lr.intercept_[0], 4));
print(y_string)
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
plt.figure(figsize=(35, 20))
tree.plot_tree(dt)
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(x_train, y_train)
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(x_train, y_train)
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train, y_train)
# C-Support Vector Classification
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(x_train, y_train)
print("Logistic Regression Score : {}".format(lr.score(x_test, y_test)))
print("Decision Tree Score ..... : {}".format(dt.score(x_test, y_test)))
print("Random Forest Score ..... : {}".format(rf.score(x_test, y_test)))
print("Knn Score ............... : {}".format(knn.score(x_test, y_test)))
print("Naive Bayes Score ....... : {}".format(nb.score(x_test, y_test)))
print("SVM Score ............... : {}".format(svm.score(x_test, y_test)))
Now lets try different test size as it will have a great impact on accuracy of algorithms.
# Define test sizes
test_size = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6]
lr_acc = [None] * len(test_size)
dt_acc = [None] * len(test_size)
rf_acc = [None] * len(test_size)
knn_acc = [None] * len(test_size)
nb_acc = [None] * len(test_size)
svm_acc = [None] * len(test_size)
Logistic regression with different test sizes
print("Logistic Regression\n")
j = 0
for i in test_size:
x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
clf = LogisticRegression()
clf.fit(x_train, y_train)
clf.predict(x_test)
accuracy = clf.score(x_test,y_test)
lr_acc[j] = accuracy
j = j + 1
print("Accuracy for test size", i, "is", accuracy)
print("\n------------------------------------------------")
Decision Tree with different test sizes
print("Decision Tree\n")
j = 0
for i in test_size:
x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
dt.predict(x_test)
accuracy = dt.score(x_test,y_test)
dt_acc[j] = accuracy
j = j + 1
print("Accuracy for test size", i, "is", accuracy)
print("\n------------------------------------------------")
Random Forest with different test sizes
print("Random Forest\n")
j = 0
for i in test_size:
x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(x_train, y_train)
rf.predict(x_test)
accuracy = rf.score(x_test,y_test)
rf_acc[j] = accuracy
j = j + 1
print("Accuracy for test size", i, "is", accuracy)
print("\n------------------------------------------------")
KNeighbor with different test sizes
print("KNeighbors\n")
j = 0
for i in test_size:
x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(x_train, y_train)
knn.predict(x_test)
accuracy = knn.score(x_test,y_test)
knn_acc[j] = accuracy
j = j + 1
print("Accuracy for test size", i, "is", accuracy)
print("\n------------------------------------------------")
Naive Bayes with different test sizes
print("Naive Bayes\n")
j = 0
for i in test_size:
x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
nb = GaussianNB()
nb.fit(x_train, y_train)
nb.predict(x_test)
accuracy = nb.score(x_test, y_test)
nb_acc[j] = accuracy
j = j + 1
print("Accuracy for test size", i, "is", accuracy)
print("\n------------------------------------------------")
SVM with different test sizes
print("SVM\n")
j = 0
for i in test_size:
x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
svm = SVC(random_state = 1)
svm.fit(x_train, y_train)
svm.predict(x_test)
accuracy = svm.score(x_test,y_test)
svm_acc[j] = accuracy
j = j + 1
print("Accuracy for test size", i, "is", accuracy)
print("\n------------------------------------------------")
Visualize results for different test sizes and algorithms
plt.figure(figsize=(30, 15))
plt.plot(test_size, lr_acc, label='Logistic Regression')
plt.plot(test_size, rf_acc, label='Random Forest')
plt.plot(test_size, dt_acc, label='Decision Tree')
plt.plot(test_size, knn_acc, label='KNN')
plt.plot(test_size, nb_acc, label='Naive Bayes')
plt.plot(test_size, svm_acc, label='SVM')
plt.xlabel('Test size')
plt.ylabel('Accurancy')
plt.title('')
plt.legend()
plt.show()
Let's test different algorithms on different batch sizes. We will compare accuracy and balanced F-score metrics.
f1_score function compute the F1 score, also known as balanced F-score or F-measure
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 (precision recall) / (precision + recall)
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
acc_list_lr_acc = []
acc_list_lr_f1 = []
acc_list_dt_acc = []
acc_list_dt_f1 = []
acc_list_rf_acc = []
acc_list_rf_f1 = []
acc_list_knn_acc = []
acc_list_knn_f1 = []
acc_list_nb_acc = []
acc_list_nb_f1 = []
acc_list_svm_acc = []
acc_list_svm_f1 = []
train_batch = [5, 10, 15, 25, 50, 75, 100, 125, 150, 175, 200, 250]
for train_size in train_batch:
train_data_x = x[:train_size]
test_data_x = x[train_size:train_size + 50]
train_data_y = y[:train_size]
test_data_y = y[train_size:train_size + 50]
X_train = train_data_x
Y_train = train_data_y
X_test = test_data_x
Y_test = test_data_y
# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, Y_train)
acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, Y_train)
acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(X_train, Y_train)
acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train, Y_train)
acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, Y_train)
acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
# C-Support Vector Classification
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(X_train, Y_train)
acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
print("Train with train size", train_size)
print("Logistic Regression F-measure binary {:3.4f}".format(f1_score(lr.predict(X_test), Y_test, average='binary')))
print("Logistic Regression Calculated accuracy: {:3.4f}%\n".format(accuracy_score(lr.predict(X_test),Y_test)*100))
print("Decision Tree F-measure binary {:3.4f}".format(f1_score(dt.predict(X_test), Y_test, average='binary')))
print("Decision Tree Calculated accuracy: {:3.4f}%\n".format(accuracy_score(dt.predict(X_test),Y_test)*100))
print("Random Forest F-measure binary {:3.4f}".format(f1_score(rf.predict(X_test), Y_test, average='binary')))
print("Random Forest Calculated accuracy: {:3.4f}%\n".format(accuracy_score(rf.predict(X_test),Y_test)*100))
print("K-Nearest Neighbors F-measure binary {:3.4f}".format(f1_score(knn.predict(X_test), Y_test, average='binary')))
print("K-Nearest Neighbors Calculated accuracy: {:3.4f}%\n".format(accuracy_score(knn.predict(X_test),Y_test)*100))
print("Gaussian Naive Bayes F-measure binary {:3.4f}".format(f1_score(nb.predict(X_test), Y_test, average='binary')))
print("Gaussian Naive Bayes Calculated accuracy: {:3.4f}%\n".format(accuracy_score(nb.predict(X_test),Y_test)*100))
print("C-Support Vector Classification F-measure binary {:3.4f}".format(f1_score(svm.predict(X_test), Y_test, average='binary')))
print("C-Support Vector Classification Calculated accuracy: {:3.4f}%\n".format(accuracy_score(svm.predict(X_test),Y_test)*100))
print("-----------------------------------------------------------------")
plt.figure(figsize=(35, 20))
plt.title("Dependency of accuracy on batch size")
plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_acc, label='KNN')
plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_acc, label='SVM')
plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('')
plt.legend()
plt.show()
plt.figure(figsize=(35, 20))
plt.title("Dependency of F-measure on batch size")
plt.plot(train_batch, acc_list_lr_f1, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_f1, label='Random Forest')
plt.plot(train_batch, acc_list_dt_f1, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_f1, label='KNN')
plt.plot(train_batch, acc_list_nb_f1, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_f1, label='SVM')
plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('')
plt.legend()
plt.show()
As we can see from figure, the best algorithm is Logistic Regression. It shows an increase in almost the entire segment. We can say that the batch size of 125 is enough to get accuracy about 90%. So let's research our model deeper.
res_acc_list_lr_acc = []
res_acc_list_lr_f1 = []
res_train_size = 125
train_data_xx = x[:res_train_size]
train_data_yy = y[:res_train_size]
XX_train = train_data_xx
YY_train = train_data_yy
res_test_batch = [10, 15, 25, 30, 75, 100, 125, 150, 175, 200, 225, 250, 300]
for res_test_size in res_test_batch:
test_data_xx = x[:res_test_size]
test_data_yy = y[:res_test_size]
XX_test = test_data_xx
YY_test = test_data_yy
# Logistic Regression
res_lr = LogisticRegression()
res_lr.fit(XX_train, YY_train)
res_acc_list_lr_acc.append(accuracy_score(res_lr.predict(XX_test),YY_test))
res_acc_list_lr_f1.append(f1_score(res_lr.predict(XX_test), YY_test, average='binary'))
print("Ttest with test size:", res_test_size)
print("Logistic Regression calculated F-measure binary {:3.4f}".format(f1_score(res_lr.predict(XX_test), YY_test, average='binary')))
print("Logistic Regression calculated accuracy: {:3.4f}%\n".format(accuracy_score(res_lr.predict(XX_test),YY_test)*100))
print("-----------------------------------------------------------------")
plt.figure(figsize=(35, 20))
plt.title("test")
plt.plot(res_test_batch, res_acc_list_lr_acc, label='Logistic Regression Accuracy')
plt.plot(res_test_batch, res_acc_list_lr_f1, label='Logistic Regression F-measure')
plt.xlabel('test size')
plt.ylabel('accuracy')
plt.title('')
plt.legend()
plt.show()
print("Logistic Regression linear equation:")
res_y_string = "y = "
for i in range(13):
res_y_string += str(round(res_lr.coef_[0][i], 4))+"*x"+str(i)+" + "
res_y_string += str(round(res_lr.intercept_[0], 4));
print(y_string)
Let's test our model on such patient as:
acc_list_lr_acc = []
acc_list_lr_f1 = []
acc_list_dt_acc = []
acc_list_dt_f1 = []
acc_list_rf_acc = []
acc_list_rf_f1 = []
acc_list_knn_acc = []
acc_list_knn_f1 = []
acc_list_nb_acc = []
acc_list_nb_f1 = []
acc_list_svm_acc = []
acc_list_svm_f1 = []
train_batch = np.arange(15, 250, 15)
test_pred_size = 15
print(train_batch)
for train_size in train_batch:
train_data_x = x[:train_size]
test_data_x = x[train_size:train_size + test_pred_size]
train_data_y = y[:train_size]
test_data_y = y[train_size:train_size + test_pred_size]
print(x)
X_train = train_data_x
Y_train = train_data_y
X_test = test_data_x
Y_test = test_data_y
# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, Y_train)
acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, Y_train)
acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(X_train, Y_train)
acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train, Y_train)
acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, Y_train)
acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
# C-Support Vector Classification
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(X_train, Y_train)
acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
plt.figure(figsize=(35, 20))
plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_acc, label='KNN')
plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_acc, label='SVM')
plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('Dependency of accuracy on batch size, test size: 10, delta: 15')
plt.legend()
plt.show()
Test size: 10, delta: 15.
The best model is Logistic Regression. An appropriate batch size is about 105.
The worst model is aso Decision Tree
About 105 is enough to get 0.8 accurancy almost for all models
acc_list_lr_acc = []
acc_list_lr_f1 = []
acc_list_dt_acc = []
acc_list_dt_f1 = []
acc_list_rf_acc = []
acc_list_rf_f1 = []
acc_list_knn_acc = []
acc_list_knn_f1 = []
acc_list_nb_acc = []
acc_list_nb_f1 = []
acc_list_svm_acc = []
acc_list_svm_f1 = []
train_batch = np.arange(15, 250, 15)
test_pred_size = 30
print(train_batch)
for train_size in train_batch:
train_data_x = x[:train_size]
test_data_x = x[train_size:train_size + test_pred_size]
train_data_y = y[:train_size]
test_data_y = y[train_size:train_size + test_pred_size]
print(x)
X_train = train_data_x
Y_train = train_data_y
X_test = test_data_x
Y_test = test_data_y
# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, Y_train)
acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, Y_train)
acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(X_train, Y_train)
acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train, Y_train)
acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, Y_train)
acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
# C-Support Vector Classification
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(X_train, Y_train)
acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
plt.figure(figsize=(35, 20))
plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_acc, label='KNN')
plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_acc, label='SVM')
plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('Dependency of accuracy on batch size, test size: 30, delta: 15')
plt.legend()
plt.show()
Test size: 30, delta: 15.
The best model is also Logistic Regression. An appropriate batch size is about 90 to get 87% accurancy.
The worst model is Decision Tree
About 90 is enough to get 0.8 accurancy almost for all models
acc_list_lr_acc = []
acc_list_lr_f1 = []
acc_list_dt_acc = []
acc_list_dt_f1 = []
acc_list_rf_acc = []
acc_list_rf_f1 = []
acc_list_knn_acc = []
acc_list_knn_f1 = []
acc_list_nb_acc = []
acc_list_nb_f1 = []
acc_list_svm_acc = []
acc_list_svm_f1 = []
train_batch = np.arange(15, 240, 15)
test_pred_size = 60
print(train_batch)
for train_size in train_batch:
train_data_x = x[:train_size]
test_data_x = x[train_size:train_size + test_pred_size]
train_data_y = y[:train_size]
test_data_y = y[train_size:train_size + test_pred_size]
print(x)
X_train = train_data_x
Y_train = train_data_y
X_test = test_data_x
Y_test = test_data_y
# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, Y_train)
acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, Y_train)
acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(X_train, Y_train)
acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train, Y_train)
acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, Y_train)
acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
# C-Support Vector Classification
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(X_train, Y_train)
acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
plt.figure(figsize=(35, 20))
plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_acc, label='KNN')
plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_acc, label='SVM')
plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('Dependency of accuracy on batch size, test size: 60, delta: 15')
plt.legend()
plt.show()
Test size: 60, delta: 15.
The best model is also Logistic Regression. An appropriate batch size is about 90 to get 87% accurancy.
The worst model is Decision Tree
About 90 is enough to get 0.8 accurancy almost for all models
acc_list_lr_acc = []
acc_list_lr_f1 = []
acc_list_dt_acc = []
acc_list_dt_f1 = []
acc_list_rf_acc = []
acc_list_rf_f1 = []
acc_list_knn_acc = []
acc_list_knn_f1 = []
acc_list_nb_acc = []
acc_list_nb_f1 = []
acc_list_svm_acc = []
acc_list_svm_f1 = []
train_batch = np.arange(15, 200, 15)
test_pred_size = 100
print(train_batch)
for train_size in train_batch:
train_data_x = x[:train_size]
test_data_x = x[train_size:train_size + test_pred_size]
train_data_y = y[:train_size]
test_data_y = y[train_size:train_size + test_pred_size]
print(x)
X_train = train_data_x
Y_train = train_data_y
X_test = test_data_x
Y_test = test_data_y
# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, Y_train)
acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, Y_train)
acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(X_train, Y_train)
acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train, Y_train)
acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, Y_train)
acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
# C-Support Vector Classification
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(X_train, Y_train)
acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
plt.figure(figsize=(35, 20))
plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_acc, label='KNN')
plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_acc, label='SVM')
plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('Dependency of accuracy on batch size, test size: 100, delta: 15')
plt.legend()
plt.show()
Test size: 100, delta: 15.
The best model is also Logistic Regression. An appropriate batch size is about 90 to get 85% accurancy. It also for Naive Bayes. It's not so bad and got good results after batch > 150%.
The worst model is also Decision Tree
But here 0.8 accurancy the size is 150 almost for all models
As result we researched the medical Heart Disease UCI Dataset.
What did we do and get?